Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Mapping of Sequence Reads to the Reference Genomes ◾ 77

algorithm uses uncompressed suffix array (SA) data structure to perform sequential

maximum mappable seed search, which is defined by the developer as the longest sub-

string of a read that matches exactly one or more substrings of the reference genome. This

search is achieved by mapping seeds to the reference genome. A read with a splice junction

site will not be mapped continuously. The algorithm will try to align the first unmapped

seed to a donor splice site and then it repeats the search and aligns the unmapped to an

acceptor splice site. The search is performed for forward and reverse direction. This kind

of search will help in the detection of base mismatches and InDels. If a single or multiple

mismatches are found, the matched substrings will act as anchors on the genome to allow

extension. The search is then followed by a seed clustering by proximity for determining

the anchor seeds. Then, the aligned seeds around the anchor seeds within a user-defined

window are stitched together using dynamic programming. STAR is capable of detecting

splices and chimeric transcripts and mapping complete RNA transcripts that are formed

from non-contiguous exons in eukaryotes [8].

The STAR software can be installed by following the installation instructions, which are

available at “https://github.com/alexdobin/STAR”. On Ubuntu, you can install STAR using

the following command:

sudo apt install rna-star

As most of the read aligners, STAR basic workflow includes both index generation and read

alignment. However, for index generation, both a reference genome in the FASTA format

and reference annotation file in GTF format are required. Pre-built indexes for genomes

of some species can be downloaded from the STAR official website. As discussed before,

the reference genomes can be downloaded from databases such as NCBI Assembly, UCSC

genome collection, or any other database. For the aligners discussed before, we down-

loaded the human reference genome from the NCBI Genome database. For STAR, we will

download the human reference genome and its GTF annotation file from the UCSC data-

base. The reason is that UCSC maintains the gene annotation file in GTF format. Use the

following command to create a new directory “ucscref” and then download and decom-

press the human reference genome and GTF annotation file:

mkdir ucscref

wget \

-O “ucscref/hg38.fa.gz” \

https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/

hg38.fa.gz

wget \

-O “ucscref/hg38.fa.gz” \

https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/

hg38.ncbiRefSeq.gtf.gz

gzip -d ucscref/hg38.fa.gz

gzip -d ucscref/hg38.ncbiRefSeq.gtf.gz

Then, we will build the index for the reference genome using the “STAR” command.